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Abstract 

This paper proposes a novel kernel approach to linear dimension 
reduction for supervised learning. The purpose of the dimension re- 
duction is to find directions in the input space to explain the output as 
effectively as possible. The proposed method uses an estimator for the 
gradient of regression function, based on the covariance operators on 
reproducing kernel Hilbert spaces. In comparison with other existing 
methods, the proposed one has wide applicability without strong as- 
sumptions on the distributions or the type of variables, and uses com- 
putationally simple eigendecomposition. Experimental results show 
that the proposed method successfully finds the effective directions 
with efficient computation. 

1 Introduction 

Dimension reduction is involved in most of modern data analysis, in which 
high dimensional data must be often handled. The purpose of dimension 
reduction is multifold: preprocessing for another data analysis, aiming at less 
expensive computation in later processing, or construction of readable low 
dimensional expressions. There are two categories of dimension reduction: 
unsupervised methods such as PCA, and supervised methods such as Fisher 
discriminant analysis (FDA). This paper focuses on dimension reduction for 
supervised learning. 
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Let (X,Y) be a random vector such that X takes values in M. m . The 
domain of Y can be arbitrary, either continuous, discrete, or structured. 
Supervised learning concerns how Y is explained by X. The purpose of 
dimension reduction in this setting is to find such features of X that explain 
Y as effectively as possible. This paper focuses linear dimension reduction, in 
which linear combinations of the components of X are used to make effective 
features. Although there are many methods for extracting nonlinear features 
including kernel methods, this paper confines its attentions on linear features 
for the following reasons: (i) nonlinear feature extraction such as kernel 
method depends strongly on the choice of the nonlinearity (see Sec. 13.21 
Wine data, for example). Linear methods are more stable, (ii) we can apply 
some nonlinear transform 4>(X) of X so that linear combinations of <fi(X) 
give effective features of X, once a linear dimension reduction method is 
established. 

Beyond the classical approaches such as FDA and CCA, the modern 
approach to this linear dimension reduction is based on the formulation by 
conditional independence. More precisely, we assume 

p(Y\X) = p(Y\B T X) or equivalents Y MX \ B T X (1) 

for the distribution, where B is a projection matrix (B T B = Id) onto a d- 
dimensional subspace (d < m) in M m , and wish to estimate B. The subspace 
spanned by the column vectors of B is called the effective direction for 
regression, or EDR space |14j . We consider methods of estimating B without 
specific parametric models for p(y\x), unlike the model-based approach such 
as [E] 

The first method that aims at finding the EDR space is the sliced inverse 
regression (SIR, [13J ) , which employs the fact that the inverse regression 
£?[X|Y] lies in the EDR space under some assumptions. Many methods 
have been proposed in this vein of inverse regression ([3j E2]), which use 
some statistic in each slice of Y. While many inverse regression methods 
are computationally simple, they often need some strong assumptions on 
the distribution of X such as elliptic symmetry, and slice-based methods are 
not effective for classification, where the number of slices is at most that 
of classes. Another interesting approach is the minimum average variance 
estimation (MAVE [21]), in which the conditional variance of the regression 
in the direction of B T X, E[(Y - E[Y\B T X]) 2 \B T X], is minimized with the 
conditional variance estimated by the local linear kernel smoothing method. 
The kernel smoothing method requires, however, careful choice of bandwidth 
parameter, and it is usually difficult to apply if the dimensionality is very 
high. 
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The most relevant to this paper is the methods that use the gradient 
of regressor (p(x) = E[Y\X = x] [TBI [II]- As explained in Sec. 12, 1\ under 
Eq. ([I]) the gradient of </?(x) is contained in the EDR space. One can estimate 
the space by nonpar ametric estimation of the gradient. There are some 
limitations in this method, however: the nonparametric estimation of the 
gradient in high-dimensional spaces is challenging, and the gradient is not 
estimable if some symmetry holds in the system. 

A kernel method for dimension reduction has been proposed to overcome 
various limitations of existing methods. The kernel dimension reduction 
(KDR, [3 El I20j ) uses the kernel method to characterize the conditional 
independence relation in Eq. ([T]). While KDR is a general method applicable 
to a wide class of problems without requiring any strong assumptions on the 
distributions or types of J or F, the optimization needed for the estimation 
is computationally a problem: the objective function is non-convex, and the 
gradient descent method demands many inversions of Gram matrices, which 
prohibits applications to very high-dimensional or large data. 

We propose a novel kernel method for dimension reduction using the 
gradient-based approach, but unlike the existing ones [EJ [H] , the gradient 
is estimated by the covariance operators with positive definite kernels, which 
is based on the recent development in the kernel method [HI H7| . It solves the 
problems of existing methods: by virtue of the kernel method the response 
Y can be of arbitrary type, and the kernel estimation of the gradient is 
stable without careful decrease of bandwidth. It solves also the problem of 
KDR: the estimator by an eigenproblem needs no numerical optimization. 
The method is thus applicable to large and high-dimensional data, as we 
demonstrate experimentally. 

2 Gradient-based kernel dimension reduction 

In this paper, the range of an operator A is denoted by 1Z(A). 

2.1 Gradient of a regression function and dimension reduc- 
tion 

We first review the basic idea of the gradient-based method for dimension 
reduction in supervised learning, which has been used in [161 [H]. Suppose Y 
is a real- valued random variable such that the regression function = 



3 



x] is differentiable w.r.t. x. If the assumption Eq. ([T]) holds, we have 



d .E[Y\x = 4 = I jypmiv = jv d -^p±iy = fl /» ^ 



dx dx J J dx " J dz 

which implies that the gradient J^E[Y\X = x] at any x is contained in 
the EDR space. Based on this fact, the average derivative estimates (ADE, 
|16j ) has been proposed to use the average of the gradients for estimating 
B. In the more recent method [TT], assuming that Y is one-dimensional 
continuous variable, a standard local linear least squares with a smoothing 
kernel (not necessarily positive definite kernel) [3] is used for estimating the 
gradient, and the dimensionality of the projection is iteratively reduced to 
the desired one. Since the gradient estimation for high-dimensional data 
is difficult in general, the iterative reduction is expected to give a more 
accurate estimation. We call the method in [11] iterative average derivative 
estimates (IADE). 



dy, 

=B T x 



2.2 Kernel method for conditional expectation 

It has been recently revealed that the apparatus of positive definite kernels 
or reproducing kernel Hilbert space (RKHS) can be applied to estimate the 
regression function or conditional expectation with covariance operators on 
RKHS [ZHSldT], which we briefly review below. For a set Q, a (R- valued) 
positive definite kernel k on f2 is a symmetric kernel k : f2 x 0, — > M such 
that YTi j = iCiCjk(xi,Xj) > for any x\,...,x n in 17 and ci,...,c n G M. 
It is known that a positive definite kernel on £1 uniquely defines a Hilbert 
space H consisting of functions on Q such that (i) k(-,x) is in T~L, (ii) the 
linear hull of {k(-,x) \ x £ Q} is dense in H, and (iii) for any x 6 and 
/ G H, (/, k(-,x))u = f(x) (reproducing property), where (•, is the inner 
product of T~L. The Hilbert space H is called the reproducing kernel Hilbert 
space (RKHS) associated with k. 

Let (X,Bx, [ix) and (y, By, fiy) be measure spaces, and (X, Y) be a 
random variable on X x y with probability P. We assume that the proba- 
bility density function (p.d.f.) p(x, y) and the conditional p.d.f. p(y\x) always 
exist. Also, we always assume that a positive definite kernel is measurable 
and bounded: the boundedness means sup xg Q k(x, x) < oo. 

Let kx and ky be positive definite kernels on X and y, respectively, 
with respective RKHS Hx and 7~Ly. The (uncentered) covariance operator 
Cyx '■ T~ix — > Hy is defined by the equation 

(9,C Y xf)ny =E[f(X)g(Y)} =E[(f,* x (X)) Hx (*y(Y),g) ny ] (2) 
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for all / G Wx,9 6 Hy, where $x{x) = kx(-,x) and <&y(y) = ky(-,y). 
Similarly, Cxx denotes the operator on %x that satisfies (/2, Cxxfl) = 
E[f2(X)fi(X)] for any fx, f2 G T~ix- These definitions are straightforward 
extensions of the ordinary covariance matrices, if we consider the covariance 
of the random vectors &x{X) and 3>y(Y) on RKHS. 

By setting g = ky(-,y) in Eq. ([2]), the reproducing property derives 

(C Y xf)(y) = J k y (y,y)f(x)dP(x,y), (C xx f)(x) = J k x (x, x)f(x)dP x (x), 

which shows the explicit expressions of Cyx and Cxx as integral operators. 

An advantage of the kernel method is that estimation with finite data is 
straightforward. Given i.i.d. sample {X\,Y\), . . . , (X n ,Y n ) with law P, the 
covariance operator is estimated by 

1 n i n 

d rxf = ~ E **(•> yi){kx(;Xi), f)n x = - E fWky(; Y). (3) 

n i=l n i=l 

The estimator C xx is given similarly. It is known that these estimators are 



/n-consistent in Hilbert-Schmidt norm |10j . 

The fundamental result in discussing conditional probabilities with ker- 
nels is the following fact. 

Theorem 1 ([7i). If E[g(Y)\X = ■} G Hx holds for g G Uy, then 

C X xE[g(Y)\X = ■} = C XY g. 

If Cxx is injectiv^], the above relation can be expressed as 

E[g(Y)\X = ■} = Cxx^Cxyg- (4) 

The assumption E[g(Y)\X = •] G %x may not hold in general; we can 
easily make counterexamples with Gaussian kernel and Gaussian distribu- 
tions. We can nonetheless obtain an empirical estimator based on Eq. @, 
namely, 

(Cxi + £ nl) 1 C X yg, 

where e n is a regularization coefficient in Thikonov-type regularization. As 
we discuss in Appendix, we can in fact prove rigorously that this estimator 
converges to E[g(Y)\X = ■]. 



1 Noting (Cxxf,f) = E[f(X) 2 ], it is easy to see that Cxx is injective, if kx is a 
continuous kernel on a topological space X, and Px is a Borel probability measure such 
that P(U) > for any open set U in X. 
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To apply the above kernel expressions to the method discussed in Sec. 12, 1\ 
we need a way of taking the derivative of a function. It is known (e.g., [19] 
Sec. 4.3) that if a positive definite kernel k(x, y) on an open set in Euclidean 
space is continuously differentiable with respect to x and y, every / in the 
corresponding RKHS is continuously differentiable. If further -g^k(-,x) £ 
W^, we have 

§£-('■=*■■>)„.■ 

Namely, the derivative of any function in that RKHS can be computed in the 
form of the inner product. This property combined with the above kernel 
estimator of E[g{Y)\X = x] provides a method for dimension reduction. 



2.3 Gradient-based kernel method for dimension reduction 
2.3.1 Algorithm 

Assume that X = W 71 , Cxx is injective, kx{x,x) is continuously differen- 
tiable, E[g(Y)\X = x]£U x for any g G Hy, and ^k x {-,x) £ K(C X x)- It 
follows from Eqs. (jJJ) and ([5]) that 

Imnx - x] - (c- ACxY9 , ) = <„ c YxC - A ^y 

(6) 

Define * : M. m -)• Tiy, x ^ E[ky(-,Y)\X = x]. By plugging g = k(-,y) into 
Eq. ([6]), we see 

d*M = CyxC -i gM^) 

dx xx dx 

On the other hand, from ^(x) = j ky(-,y)p(y\x)dfj, y (y), the same argument 
as in Sec. 12.11 shows that — |jr^ = H(x)-B with an operator from M m 

to "Hy, where we use a slight abuse of notation by identifying the operator 
with a matrix. Taking the inner product in Hy, we have 

B T (~(x),~(x)) ny B = (^.^fefeC-^) =: Mix), 

which shows that the eigenvectors for non-zero eigenvalues of the mxm sym- 
metric matrix M(x) are contained in the EDR space. This fact is the basis 
of the proposed method. Note that, in comparison with the conventional 
gradient-based method described in Sec. 12.11 this method is interpreted as 
considering simultaneously various regression functions E[ky(y,Y)\X = x] 
given by all y £ y. 
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Given i.i.d. sample (Xi, Y%), . . . , (X n , Y n ) from the true distribution, based 
on the empirical covariance operators Eq. ([3]) and regularized inversions, the 
matrix M(x) is estimated by 

M (~\ — / dk x (-,x) /j^{n) T \-lfi(n) P(n) /^(n) j\ -1 dk x (-,x) \ 

= Vk x (x) T (G x + ne n i) _1 Gy (G x + nenI) -1 Vkjf (s), (7) 

where Gx and Gy are Gram matrices (/^(Xj, Xj)) arid (Ary(li, 1})), respec- 
tively, and Vk x (x) = (^g^, • • • , dk *£"' x) ) T g R". 

As the eigenvectors of M(x) are contained in the EDR space for any x, 
we propose to use the average of M(Xi) over all the data points X{, and 
define 

M n = hJ2"=iMn(Xi) = iEr=iVkx(^) T (Gx+ne„/ n )- 1 Gy(G x +ne n /„)- 1 Vk x (X i ). 

In the case of Gaussian kernel, for example, Vkx(Xi) is given by (Xj — 
Xj) exp(— 2^3-||Xj — X,-|| 2 ), which is the Hadamard product between the 
Gram matrix Gx and {X-i — Xj)^- =1 . 

The projection matrix B in Eq. JT]) is then estimated by the top d eigen- 
vectors of the m x m symmetric matrix M n . We call this method gradient- 
based kernel dimension reduction (gKDR). 

2.3.2 Discussions and extensions 

The proposed gKDR applies to a wide class of problems. In contrast to many 
existing methods, the gKDR can handle any type of data for Y including 
multivariate or structured variables, and make no strong assumptions on the 
distribution of X. The gKDR method can be applied to classification and 
continuous output exactly in the same manner. 

The previous gradient-based methods ADE and IADE have an obvious 
weakness. Suppose Y is one-dimensional and Y = ip(B T X) + Z, where Z is a 
zero-mean noise. If E[ip' (B T X)] = 0, the subspace spanned by B cannot be 
estimated. This condition holds if cp and the distribution of X satisfy some 
symmetry. These methods in general find only a subspace of the EDR space. 
In contrast, the gKDR approach incorporates various functions ky(y,-) for 
ip, as discussed in Sec. 12.3.11 and thus this weakness may be avoided. 

As in all kernel methods, the results of gKDR depend on the choice 
of kernels, though the linear features are less sensitive to the choice than 
nonlinear features. We use the cross-validation (CV) for choosing kernels 
and parameters, combined with some regression or classification method. In 
this paper, the k-nearest neighbor (kNN) regression / classification is used 
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in CV for its simplicity: for each candidate of a kernel or parameter, we 
compute the CV error by the kNN method with the input data projected on 
the subspace given by gKDR, and choose the one that gives the least error. 

The time complexity of the matrix inversions and the eigendecomposi- 
tion required for gKDR are 0(n 3 ), which is prohibitive for large data sets. 
We can apply, however, low-rank approximation of Gram matrices, such as 
incomplete Cholesky decomposition [5], which is a standard method for re- 
ducing time complexity in kernel methods. The space complexity may be 
also a problem of gKDR, since (Vkx(^Q))f=i h as n 2 x m dimension. In the 
case of Gaussian kernel, we have a way of reducing the necessary memory by 
low rank approximation of the Gram matrices. Note that ■^s:kx{Xj,x)\ x= x i 
for Gaussian kernel is given by -^i(Xj — Xf ) exp(— \\Xj — Xi \\ 2 / (2a 2 )) . Let 
Gx ~ RR T and Gy ~ HH T be the low rank approximation with r x = 
rkR,r y = rkff (r x ,r y < n,m). With the notation F := (Gx + nSnln)^ 1 H 
and 6" s = ^XfRi s , we have 

n r y 

Mn,ab = J2J2 T iA (l<o,6<m), 

2=1 t=l 

n r x 1 r x n r x n 

r - = E E V2 {x*-x?)R JsRls F lt = J2 Ris (E e r^) -E @ * s (E R ^) ■ 

j=l s=l s=l j=l s=l j=l 

With this method, the complexity is 0(nmr) in space and 0(nm 2 r) in time 
(r = m&x{r x , r y }), which is much more efficient in memory than straight- 
forward implementation. 

We introduce two variants of gKDR. First, as discussed in [TT|, accurate 
nonparametric estimation for the derivative of regression function with high- 
dimensional X is not easy in general. We propose a method for decreasing 
the dimensionality iteratively in a similar manner to IADE. Using gKDR, 
we first find a projection matrix B\ of a larger dimension d\ than the target 
dimensionality d, project data Xi onto the subspace as = BfXi, and 

find the projection matrix B 2 (d\ x d 2 matrix) for zf^ onto a d 2 (d 2 < d\) 
dimensional subspace. After repeating this process to the dimensionality d, 
the final result is given by B = Bg ■ ■ ■ B 2 B\. In this way, we can expect the 

(s) 

later projector is more accurate by the low dimensionality of the data Z\ . 
We call this method gKDR-i. 

Second, in classification problems, where the L classes are encoded as L 
different points, the Gram matrix Gy is of rank L at most. We can have 
at most L dimensional subspace by the gKDR method (see Eq. ([7])), which 
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is a strong limitation of gKDR, especially for binary classification. Note 
that this problem is shared by many linear dimension reduction methods 
including CCA and slice-based methods. To solve this problem, we propose 
to use the variation of M n (x) over all points x = X{ instead of the average 
M n . We compute the projection matrix Bi from M n {Xi) at each i, take the 
average of projectors P = ^ J2i=i Bi^I \ an d gi ye the estimator B by the top 
eigenvectors of P. In practice, eigendecomposition of M{X{) for all i may 
not be feasible. In that case, by partitioning {1, . . . , n} into Ti, . . . , Tg, the 
projection matrices Br a i given by the eigenvectors of Afui = S«er a M(X{) 
can be used to define P = | X^a=i -^M-^ja]- We can * n ^ s metn °d gKDR-v. 

2.3.3 Theoretical analysis of gKDR 

Under some conditions, we can obtain the consistency and its rate for M n (x) 
and M n . We assume all the RKHS are separable, and denotes Frobe- 

nius norm of a matrix M. 



Theorem 2. Assume that — g^a ^ ^(C^j ) (a = l,...,m) for some 
f3 > and E^/c^y, Y)|X = •] G /or every y £ y. Then, for e n = 



for every x S X as n — )■ oo. If further E\\\M {X)\\ 2 F ] < oo and — g^a = 
C^j/iJ mt/i £'||/i^ : ||-^ A , < oo, f/ien M n — >■ £?[M(X)] in £/ie same order as 
above. 

The proof is given in Appendix. Note that, assuming that the eigenvalues 
of M(x) or E[M{X)\ are all distinct, the convergence of matrices implies the 
convergence of the eigenvectors, thus the estimator of gKDR is consistent 
to the subspace given by the top eigenvectors of E[M(X)]. 

3 Experimental results 

We always use the Gaussian kernel k(x, x) = exp(— t^II 2 - - ^ll 2 ) i n the kernel 
method below. 

3.1 Synthesized data 

First we use two types of synthesized data, which have been used in [11], to 
verify the basic performance of gKDR and the two variants. The data are 
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^3'2/3+2^ ; W e have 
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gKDR 


gKDR-i 


gKDR-v 


gKDR+KDR 


IADE 11 


(A) n = 100 
(A) n = 200 


0.2114 (0.0636) 
0.1393 (0.0362) 


0.1905 (0.0495) 
0.1217 (0.0352) 


0.2101 (0.0704) 
0.1356 (0.0351) 


0.0883 (0.1473) 
0.0501 (0.0964) 


0.0903 
0.0537 


(B) n = 100 
(B) n = 200 


0.1500 (0.0363) 
0.0755 (0.0157) 


0.1358 (0.0347) 
0.0750 (0.0153) 


0.1630 (0.0398) 
0.0802 (0.0160) 


0.1076 (0.0967) 
0.0506 (0.0729) 


0.182 
0.0472 



Table 1: Synthesized data. Mean and standard error (in brackets) over 100 
samples. The mean errors of IADE are taken from |llj . 



generated by 

(A) : Y = Zsn(VEZ) + W, Z = ^-(1, 2, 0, . . . , 0) T X, 

(B) : Y = (Zf + Z 2 )(Zi — Z\) + W, 

Z x = ^(1,1,0,...,0) T X, Z 2 = -^(1,-1,0,..., 0) T X, 

where 10-dimensional X is generated by the uniform distribution on [—1, l] 10 
and W is independent Gaussian noise with zero mean and variance 10 -2 . 
The sample size is n = 100 and 200. The discrepancy between the estimator 
B and the true projector Bq is measured by \\BqBq (I m — BB T )\\p /d, where 
|| • \\p is the Frobenius norm. For choosing the parameter a in Gaussian 
kernel, CV with kNN (k = 5) is used with 8 points given by ca me d (0.5 < 
c < 10), where a me d is the median of pairwise distances of data [9] (the same 
strategy is used for CV in all the experiments below). The regularization 
parameter is fixed as e n = 10~ 7 . 

We compare the results only with IADE, since |11| reports that the re- 
sults of IADE are much better than those of SIR and pHd. From Table [H 
we see that gKDR, gKDR-i (5 iterations), and gKDR-v show comparable 
results for data (B), while IADE works better for data (A). For data (B), 
when the sample size is 100, the proposed gKDR methods show much bet- 
ter results than IADE. gKDR and gKDR-v show similar errors, and gKDR-i 
improves them in all the four cases. We also use the results of gKDR as the 
initial state for KDR, which requires non-convex optimization with gradi- 
ent method. As we can see from the table, KDR improves the accuracy 
significantly, showing results better than or comparable to IADE. The op- 
timization in KDR, however, sometimes fails to find a good solution, which 
causes the large variance in the experiments. 
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Dim. 


Train 


Test 


heart-disease 


13 


149 


148 


ionoshpere 


34 


151 


200 


breast-cancer 


30 


200 


369 



Table 2: Summary of data sets: dimensionality of X and the number of data 
3.2 Real world data 

We first use Wine data, which is available at the UCI machine learning repos- 
itory jS], to demonstrate low dimensional visualization. In this data set, X 
is a 13 dimensional continuous variable, and Y is the class label representing 
three classes of wine, which is encoded as {(1,0,0), (0, 1,0), (0,0, 1)}. The 
sample size is 173. Two dimensional projections are estimated by gKDR 
and KDR. For gKDR, the parameter a in Gaussian kernel is chosen by CV 
with kNN (k = 5). As in Figured! the results by the KDR and gKDR look 
similar, while each of the classes by KDR is more condensed. With Intel 
(R) Core (TM) i7 960, 3.20GHz, the computational time required for one 
parameter set was 0.14 sec by gKDR and 4.80 sec by KDR with 50 iterations 
of line search: gKDR is 30 times faster than KDR for this data set. As com- 
parison, we show also the results by kernel CCA (KCCA) [HE]. Since the 
nonlinear mapping in KCCA easily separates the three classes with small a, 
cross-validation is unstable and inapplicable. The results given by the three 
values of a are very different for KCCA. 

One way of evaluating dimension reduction methods in supervised learn- 
ing is to consider the classification or regression accuracy after projecting 
data onto the estimated subspaces. We next use three data sets for binary 
classification, heart-disease, ionoshpere, and breast-cancer-Wisconsin, from 
UCI repository (see Table [2]), and compare the classification errors with 
gKDR-v and KDR. 

The classification rates with kNN classifiers (k = 7) for projected data 
are shown in Fig. [2j We can see that the classification ability of estimated 
subspaces by gKDR-v is competitive to those given by KDR: slightly worse 
in Ionosphere, and slightly better in Breast-cancer- Wisconsin. The compu- 
tation of gKDR-v for these data sets can be hundreds or thousands times 
faster than that of KDR. For each parameter set, the computational time 
of gKDR vs KDR was, in Heart-disease 0.044 sec / 622 sec (d = 20), in 
Ionoshpere 0.103 sec / 84.77 sec (d = 20), and in Breast-cancer- Wisconsin 
0.116 sec / 615 sec (d = 11). 

The next two data sets are larger in the sample size and dimensionality, 
for which the optimization of KDR is difficult to apply. The first one is 2007 
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(a) gKDR (b) KDR 




(c) KCCA (a =2*MedD) (d) KCCA {a =10*MedD) (e) KCCA 

(a =100*MedD) 

Figure 1: Two dimensional plots of Wine data by gKDR, KDR, and Kernel 
CCA. MedD means the median of pairwise distances among Xi [9]. 
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All variables 



3 5 10 15 20 30 
Dimensionality 

(a) Heart Disease (b) Ionoshpere (c) 

Breast-cancer- Wisconsin 



Figure 2: Classification accuracy with gKDR-v and KDR for binary classi- 
fication problems 



images of USPS handwritten digit data set used in [18J, where 256 gray scale 
pixels are provided as X for each image. First we make a three dimensional 
plot for the subset of 500 images with classes "1" through "5", as in the 
similar way to |20j . The result is shown in Fig. [3) We can see, although 
this is a linear projection, the subspace found by gKDR separates the five 
classes reasonably well. 

We evaluate the classification errors by the simple kNN classifier (k = 5) 
with the data projected onto estimated subspaces, using 1000 images for 
training and the rest for testing. We compare gKDR with CCA as a baseline. 
Table [3] shows that the subspaces found by gKDR (-i,-v) have much better 
classification ability than those given by CCA. As in the previous cases, 
gKDR and gKDR-v show similar errors, and gKDR-i (5 iterations) improves 
them slightly. 

The second large data set is ISOLET, taken from UCI repository [6]. 
The data set provides 617 dimensional continuous features of speech signals 
for each of 26 alphabets. In addition to 6238 training data, 1559 test data 
are separately provided. We evaluate the classification errors with the kNN 
classifier (k = 5) to see the effectiveness of the estimated subspaces. Table 
H] shows the error rates of classification for the test data after dimension 
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Figure 3: Three dimensional plots of USPS data (5 classes) from two differ- 
ent angles. 



Dim. 


3 


5 


7 


9 


15 


20 


25 


gKDR 


56.82 


27.96 


19.00 


16.66 








gKDR-i 


39.81 


26.17 


18.62 


15.06 








gKDR-v 


47.78 


25.89 


18.62 


15.92 


12.43 


11.73 


12.67 


CCA 


51.05 


32.62 


23.96 


24.49 









Table 3: USPS2007: classification errors for test data (percentage) 

reduction. To save computational time, we did not use gKDR-i. From the 
information on the data at the UCI repository, the best performance with 
neural networks and C4.5 with ECOC are 3.27% and 6.61%, respectively. 
In comparison with these results, we can see the simple kNN classification 
shows competitive performance on the low dimensional subspaces found by 
gKDR and gKDR-v. 

4 Concluding remarks 

We have proposed a method for gradient-based kernel dimension reduction 
and its two variants, which provide general approach for dimension reduction 
in supervised learning; they have wide applicability with little restriction on 
the distribution or type of the variables, and the computation is done with 
simple linear algebra. 



Dim. 


5 


10 


15 


20 


25 


30 


35 


40 


45 


50 


gKDR 


30.21 


13.53 


7.70 


4.55 


4.23 












gKDR-v 


29.44 


13.15 


8.28 


4.55 


3.91 


4.81 


5.26 


5.26 


5.77 


5.58 


CCA 


22.77 


15.78 


8.72 


6.74 


7.18 













Table 4: ISOLET: classification errors for test data (percentage) 
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As discussed in Sec. 12.3.21 gKDR may solve the problem of the existing 
gradient methods that they do not work if the regression function has the 
degenerate average derivative. It is then interesting to make a theoretical 
question whether gKDR can find the true EDR space. This is within our 
future works. 

This paper focuses only on the supervised setting, but it may be possible 
to extend the proposed method to the unsupervised cases in a similar way 
employed in [20]. Extension to nonlinear feature extraction is also important 
in some practical problems. As we discuss in Introduction, applying a non- 
linear transform will give a straightforward extension. Another interesting 
question is how we can "kernelize" gKDR to replace the linear features to 
nonlinear ones. This is not as straightforward as many other kernel meth- 
ods, since the differentiation with respect to the feature map is involved. 
This is also within our interesting future directions. 

A Consistency of the kernel estimator for the re- 
gression function 

We discuss the consistency of the estimator (Cxx+^nl) 1 C <y X yg for E[g(Y)\X = 
•]. While this consistency has been already proved in some literature such 
as [251 [26l [23j [2^] in various contexts, we show the proof in our terminology 
for completeness. 

Theorem 3. Let g € %y and assume that E[g(Y)\X = ■] € 1Z(C XX ) for 
v > 0, where 1Z(C XX ) for v = is interpreted as %x- If £n —> (n — > oo), 
then 

WiC^ + Eniy'd^g-EigiYyX = -]||^ 

is of the order 

Op^n-V^ + OisZ), /or <*/<!, 
O p (e- 1 n- 1 /2) + ( £n ) 7 forv>\. 

Consequently, if e n = n~ max ^ z ' 2 "+ 2 ^ , then the estimator is consistent of the 
order 0(n~ m ' m{ ^'^V2 } ) . 

Proof. Take rj E Hx such that E[g(Y)\X = •] = C xx rj. From Theorem HJ 
we have C XY g = C xx E[g{Y)\X = ■] = C^+V 
First, we show 

UcPx+eniy^&g-iCxx+eniy^xYgllu =O p {e- x rr 1 l 2 ) (n->oo). 

(8) 
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Since B 1 — A 1 = B 1 (A — B)A 1 for any invertible operators A and B, 
the left hand side is upper bounded by 

\\(dP x + eniy 1 (Cxx - CP X )(C XX + e n I)- x CxY9\\ Hx 

+ + ~ c xy)9\\ Hx . 

From Cxy9 = Cxx 7 ?' we have || {Cxx +£nI)~ l C X Yg\\ < \\C xx h]\\-h x ■ Com- 
bination of this fact with ||C^- — Cxx 1 1 = O p (n ~ 1 / 2 ) proves that the first 
term is of the order O p (e~ 1 n~ 1 ^ 2 ). The second term is of the same order 

from \\d { XY - C X y\\ = Op(n~ 1 / 2 ), which implies Eq. jSJ). 
Next, we derive the upper bounds 

\\(C xx+ , n irC XYg -EW)\X = -HU = (°< £ »!' f ° < (9) 

x yO(e n ), for v>\. 

It follows from E[g(Y)\X = •] = C xx r] and = C^r? that 

(Cxx + e n iy x C XY g - E[g{Y)\X = ■} = (Cxx + Enl^C^r) - C xxV . 

Let C xx = ^iK^ii^ir) De the eigendecomposition of C xx such that 
Aj > are the eigenvalues and <j>i are the ohorthonormal eigenvectors. The 
eigendespectrum of the operator (C xx + £ n I)~ l C x + x ri — C xx is then given 
by 

K = T^~ (< = 1,2,...)- 



\ + + 



If < !/ < 1, from = e^^f^ < ""Ti_, and I / u-, I < 1 we 



have 

|| (Cxx +a n I)- 1 C£tr ] - C xx \\ <s v n . 
If v > 1, then < e nT ^- < e n A" -1 . It follows 

II (C^x + enI)' l C x + ^ - C xx \\ < eJCxxir 1 . 
From Eqs. ([8]) and ([9]), the proof is completed. □ 

B Proof of Theorem [2] 

Let g a = dk Qx^ ■ Since 

M ab (x) = ((E[ky(*,Y)\X = -Ig^n^iElky^XU = '],9b) Ux ) 
= (E[ky(*,Y)\g a (X)],E[ky(*,Y)\g b (X)]) 
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and 

M n)ab {x) = (d^(dP x +e n iy 1 g a ,d^ ) x (dP x +e n iy 1 g b ) Hy , 
we have 

\M n>ab (x) - M ab (x)\ 

< |<4t(^ + ^/)~V,41(^ + ^/)" 1 56-^[^(*,y)| 5 6W]> w J 

+ \&U d xx +e n l)~ 1 9a - E[ky(*,Y)\g a (X)],E[ky(*,Y)\g b (X)]) Hy \. 
Noting En^/n — > oo and the expression 

{Cf x +e n iy l = {C xx +e n I)- l {l-{C xx - C^ x )(C xx +e n I)- l Y\ 
Lemma 4 in |26j shows that 

\\C XX {dP x + Sniy 1 ]] =O p (l). 

From g a = C x + X n for some 77 G Hx, we have \\Cy X (C xx + e n iy 1 g a \\ = 
O p (l). For the proof of the first assertion of Theorem [21 it is then sufficient 
to prove the following theorem. 

Theorem 4. Assume that g € %x satisfies 1Z{C X + X ) for some (3 > and 
that E[ky(y,Y)\X = •] 6 Hx for every y £ y. Then, for e n > with 
s n = n max ^-3'2(,9+i) ^ we Ji ave 

as n —7- 00. 

Proof. It suffices to show 

ll4x(^xx + £n/)" 1 5-C'y^(C'xx + en/) _1 5||i 3; =O p {e~ l ' 2 n- 1 ' 2 ) (10) 
and 

as n — > 00. In fact, optimizing the rate derives the assertion of the theorem. 
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Let g = C x + X h, where h G H x . Since B~ l - A~ l = B~ l {A - B)A~ 1 
for any invertible operators A and B, the left hand side of Eq. (jlOp is upper 
bounded by 

\\C { Yx{CxX+ £ nlY\CxX ~ CP^Cxx+Sniy'C^hW^ 

+ W { A-CYx)[Cxx + e n l)- l C^h\\ Hy . 

By the decomposition C^ ] x = d^ Y 1/2 W YX d { xx 1/2 with \\W YX \\ < 1 M), 

we have ||Cyx(cj^ + e n I) 1 || = 0(e n 1 ^ 2 ). It is known that ||Cxx — 

C^vy[| = O p (n -1 / 2 ). From these two fact, we see that the first term is of 

Op{e n l ^ 2 n" 1 / 2 ). Since the second term is of O p (n~ 1//2 ), Eq. (fT0|) is obtained. 
For Eq. (jlip . first note that for each y 



E[ky(y,Y)\g(X)] = (E[ky(y,Y)\X = •],<?) = (E[ky(y,Y)\X = -],C%gh) 

= (C X xE[ky(y,Y)\X = -],C xx h) = (C XY ky(y, -),C xx h) 
= (k y (y,.),C YX C xx h) = (C YX C xx h)(y), 

which means E[ky(-,Y)\g(X)] = C YX C xx h. Let C YX = C Y / yW YX C 1 J x be 
the decomposition with ||WVx|| < 1- Then, we have 



\C YX {C xx +e n l) l g-E[k y (;Y)\g(X)}\ 



\cl/iw YX \\\\c^ 2 (c xx + 8*1)-^ - C^'\ 



Hy 

Hy' 



Let {4>i} be the unit eigenvectors of C xx such that C xx f = ^ A»(<fo, /). 
Then the eigenspectrum of C x + X 2 (C xx + e n l) 1 — C x + X is given by 

x (2j8+l)/2 

- n x \ - (i = l,2,...). 

T £n 

,(2/3+l)/2 ,(2/3 + l)/2 (1-2/9J/2 , m ,w, 

If < f} < 1/2, we have = (^^F™ (A^™^ ^ 

e (2/3+i)/2_ Jf ^ > ^ then !s^^_! < Af" 1/2 e„, We have thus Eq. ([LI]), 
which completes the proof of Theorem H] 

□ 
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For the second assertion of Theorem [2 note 



1 n — 

-Y,M n {Xi)-E[M{X)\ 



i=i 



< 



1 — 



n 



8=1 



n 



i=l 



+ 



8=1 



M(Xi) - £[M(X)] 



The second term in the right hand side is of O p {n~ 1 / 2 ) by the central limit 
theorem. By replacing h by j t Y17=i m the proof of Theorem the 
assertion is obtained corollary. 
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